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Summarj 


The  purpose  of  tht*  expos  Ueey  paper  is  to  furnish  a  simple 
introduction  to  the  use  of  the  theory  of  dynamic  programming 
In  treating  multi-stage  decision  processes. 
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Some  Aspects  of  the  Theory  of  Dynamic  Programming 

By 

Richard  Bellman 

§1 .  Introduction. 

In  recent  years,  a  number  of  mathematical  problems  of 
novel  and  unconventional  type  have  arisen  from  the  study  of 
economic,  engineering,  Industrial,  and  military  fields  to 
challenge  the  mathematician.  A  particular  class  of  these 
problems  are  these  we  may  call  "decision  processes".  These 
involve  the  planning,  scheduling,  or  programming,  all  equi¬ 
valent  terms,  of  sequences  of  operations. 

Many  nrw  techniques  have  been  devised  to  solve  these 
problems,  and  the  very  concept  of  a  solution  has  been  altered 
as  a  consequence  o *  the  availability  of  modern  competing 
machines.  The  purpose  of  this  paper  is  to  present  some  of 
the  basic  concepts  of  one  approach  to  these  problems,  the  theory 
of  dynamic  programming. 

We  shall  illustrate  this  approach  by  considering  two 
simple  examples,  one  a  maximization  problem  of  conventional 
type,  and  the  other  a  decision  process  involving  random  or 
chance  events.  After  discussing  betn  of  these  problems,  we 
shall  attempt  to  synthesize  our  results  and  abstract  the 
common  element;*. 

§2.  The  Arithmetic  Mean — Geometric  Mean  Inequality. 

Probably  the  most  well— proved  inequality  in  analysis 
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1 0  the  arithmetic— geometric  mean  inequality  which  asserts  that 
n  n 

(1)  £  x./n  >  v/xixt . . . ,xn, 

1-1  1  n 

for  any  n  ncn— negative  quantities  x » ,xa, • • • ,xn. 

An  equivalent  form  of  the  Inequality  Is 

(2)  Max  xixa...x  -  (l/n)n, 

R  n 

where  R  Is  the  region  defined  by 

(3)  ® >  0,  1  -  1,2, ... ,n, 

n 

b.  Z  x.  -  1. 

1-1  1 

One  may  ask:  What  does  an  n-dlmenolonal  maximization 
problem  have  to  do  with  programming  a  sequence  of  operation? 

The  answer  resides  in  the  fact  that  the  choice  of  a  point  in 
n-dlmenslonn  (x 1 fxa, . • . ,x  )  may  be  considered  to  be  a  single 
operation,  namely  the  choice  of  one  point,  or  as  a  sequence 
of  operations,  requiring  first  the  choice  of  X|,  then  the  choice 
of  xa,  and  so  on.  It  is  clear  that  there  should  be  some 
analytic  and  computational  advantages  derived  from  replacing 
an  n-dlmenslonal  operation  by  a  sequence  of  n  one— dimensional 
operations . 

We  begin  with  the  observation  that  the  maximum  of  X|X|,..xn 
over  the  region  R  is  a  numerical  quantity  depending  only  upon 
n,  the  dimension.  Let  us  then  define 

un  -  Max  x iXa • • ,xn 


(4) 
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for  n  -  1,2,3,...  Clearly  ut  •  1. 

If  Me  choose  xt  as  fixed,  and  consider  Mhat  the  other  x^ 
must  be,  Me  see  that  to  achieve  the  maximum  value,  un.  Me 
must  choose  xa,x9,...,xn  ao  as  to  maximize  the  remaining 
product  X2X9...xn  subject  to  the  constraints 

(5)  xA  ^  0,  1*2, ... ,n 
b.  xa+xs+. . .»1-X|. 

Nom  this  1 8  a  problem  quite  similar  to  the  original,  except 
for  tMo  facts.  The  dimension  is  noM  (n— 1)  and  the  sum  In 
(5b)  is  1— Xj,  In  place  c.’  1,  as  in  (3b). 

Let  us  then  generalize  our  original  problem  by  con— 
sidering  the  problem  of  maximizing  P(x)  -  XiX*...xn  subject 
to  th»  constraints 

(6)  R:  a.  x^  >  0 

n 

b.  Z  x.  «  a  >  0. 

1-1  1 

The  maximum  of  P(x)  Mill  noM  be  a  function  of  n  and  a.  Define 
the  new  function 

(7)  fn(a)  "  Max  p(x)' 

n  R 

for  n  -  1,2, .. . 

For  any  choice  of  Xj  in  the  range  0  ^  ^  a,  we  see  that 

xa ,x9 , . . .  ,xn  must  be  chosen  so  as  to  maximize  xax9...xn 
subject  to  the  constraints 
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(b)  a.  x,  >  0,  1-2,3, .  •  •  .n 

b.  Xa+xj-f . .  . +xn  »  a— Jtjr 

It  1 8  clear  that  xi  +  0  or  a.  By  definition  of  the  functlona 
|fn(a)|  ,  we  have 

(9)  xax,...xn  - 

Tnu8,  for  any  choice  of  xi  In  [  0,a  ],  we  have 

(10)  x»xa...xn  -  x  (a— x  » ) , 

If  we  are  trying  to  maximize .  Since  xi  Itself  Is  also  to  be 

chcsen  to  maximize  the  final  result,  we  obtain  the  basic 

recurrence  relation 

(11)  f  (a)  -  Max  [x,f  .  .a-x,)J. 

n  0£x i^a 

For  practical  purposes,  the  problem  would  now  be  solved, 
since  we  have  reduced  the  determination  of  fn(a)  to  the  compu¬ 
tation  of  a  sequence  of  functions  of  one  variable,  namely 

(12)  fa  (a)  -  Max  [xi(a-Xi)], 

O^x  j<[a 

f  a  (a )  -  Max  [x  if  a(a-xi)]  , 

O^xi^a 

and  so  on . 

If  we  insist  upon  the  luxury  of  an  explicit  solution  we 
may  proceed  as  follows.  From  the  homogeneity  of  all  the 
relations,  we  see  that 
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(15)  fn(a)  -  anfn(l)  -  anun- 
Hence  (11 )  becomes 

(14)  anu  -  Max  x^a-xiJ^u  ,, 

n  O^xi^a  0-1 

or 


(15) 


n 


[  Max 


yd-y)"-1 1 


U 


n — 1 


ladL 1 


n-1 


n 


n 


u 


n— 1 


Starting  with  ut  ■  1 ,  we  obtain  *2)  inductively. 

§3*  A  Qo Id— Mining  Problem. 

Let  us  consider  a  problem  which  can  more  legitimately 
be  considered  a  programming  problem. 

Suppose  that  we  have  two  gold  mines,  Anaconda  and  Bonanza, 
and  a  rather  delicate  gold-mining  machine.  The  properties  of 
the  machine  are  such  that  If  ured  to  mine  either  Anaconda  or 
Bonanza  It  will  bring  to  the  surface  a  certain  fraction  of  the 
gold  In  the  mine,  and  remain  undamaged,  awaiting  further  use, 
or  It  will  be  irretrievably  damaged  and  mine  no  gold  there¬ 
after.  More  precisely,  we  assume  that  If  the  machine  Is  used 
In  Anaconda,  there  la  a  probability  Pt  that  a  fraction  ri 
of  the  gold  there  will  be  mined  end  the  machine  remain  undam¬ 
aged,  and  a  probability  Qi  that  the  machine  will  mine  nothing 
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and  oe  damaged  beyond  repair.  Similarly,  If  Bonanza  la  mined, 
the  probabllltlee  are  Pj  and  Q*  respectively,  and  the  fraction 
le  r* . 

The  process  Is  now  the  following.  At  the  first  stage  we 
choose  to  use  our  machine  In  either  Anaconda  or  Bonanza.  If 
the  machine  Is  undamaged,  we  make  a  similar  choice  at  the 
second  stage,  and  so  on,  until  the  machine  is  damaged,  at  which 
time  the  process  terminates.  Given  the  initial  quantities  of 
gold  In  the  mines,  say  x  In  Anaconda  and  y  In  Bonanza,  the 
problem  Is  to  choose  the  sequence  of  operations  which  maxi¬ 
mizes  the  expected  amount  of  gold  that  Is  mined  before  the 
macnlne  Is  defunct. 

It  Is  clear  that  we  cannot  speak  of  maximizing  the  amount 
of  gold  mined  because  of  the  probabilistic  nature  of  the  process, 
but  that  we  must  content  ourselves  with  some  average  measure 
of  the  return. 

Let  us  begin  by  observing  that  the  expected  return  from  an 
optimal  sequence  of  choices  depends  only  upon  x  and  y,  the 
Initial  quantities  In  each  mine.  With  this  In  mind,  we  define 

(l)  f(x,y)  -  expected  return  obtained  using  an  optimal 

sequence  of  choices  when  Anaconda  has  x 
and  Bonanza  t is  y 

We  shall  solve  our  problem  by  obtaining  a  recurrence  relation 
for  f(x,y).  This  we  do  In  the  following  way.  Suppose  we 
choose  to  mine  Anaconda.  If  the  machine  Is  undamaged,  we 
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obtaln  rix  and  art  laft  In  a  situation  where  Anaconda  possess 
(l-ri)x  and  Bonanza  possesses  y.  In  this  situation  we  will 
proceed  to  make  an  optimal  sequence  of  choices,  and  hence,  by 
definition,  obtain  a  further  expected  return  of  f((l-r,)x,y). 
Since  the  probability  that  the  machine  is  undamaged  is  Pi, 
we  see  that  the  expected  return  from  an  initial  choice  of 
Anaconda  is 


(2)  fA(x,y)  -  Pi(rjx+f ((l— ri)x,y)) . 

Similarly,  the  expected  return  from  an  initial  choice  of 
Bonanza  is 


(3)  fB(x,y)  -  P*(r.y+f(x,(l-ra)y)). 

Since  we  want  to  choose  the  mine  which  maximizes  the  total 
return,  the  final  equation  for  f(x,y)  id 


(r) 


f(x,y)  -  Max  (fA(x,y),  f B(x ,y ) ) 

A:  Pi (rjx+f ( (l— r * )x,y) ) , 


Max 


B:  Pa(r*y*f (x,(l— ra)y)) 


We  have  tnus  reduced  the  original  problem  to  the  analytic 
problem  of  solving  this  unconventional  functional  equation. 

In  this  case  also,  we  xu  solve  explicitly.  It  can  be  shown  that 
the  rule  which  determines  the  choice  of  Anaconda  or  Bonanza 
Is  the  following: 


I 

s 

f 

J 


P-696 

6-83-55 


Ptrix  P*r*y 

a.  If  -  > -  ,  choose  Anaconda, 


1-r 


1-r* 


(5) 


Ptrix  P*r*y 

b.  if  -  <  -  ,  choose  Bonanza, 


1-r 


1-r* 


Pir»x  P*r*y 

c.  If  - -  -  -  ,  choose  either. 


1-r  i 


1-r* 


Observe  that  what  we  call  the  "solution",  Is  not  an 
analytic  expression  for  f (x ,y ) ,  which  is  relatively  unimportant, 
but  a  rule  for  carrying  out  the  optimal  sequence  of  choices, 
which  Is,  after  all,  what  we  wanted.  The  functional  equation 
merely  serves  as  an  intermediary  for  the  determination  of  the 
optimal  sequence  of  choices. 

§4 .  Analysis. 


Let  us  now  see  If  we  can  extract  the  common  features  of 
these  quite  dissimilar  problems.  In  so  doing  we  shall  be  able 
to  recognize  other  problems  which  may  be  treated  by  the  aame 
techniques. 

In  each  problem  the  status  of  the  process  is  described 
by  a  small  number  of  parameters.  In  the  maximization  problem 
these  were  the  number  of  variables  remaining  to  be  chosen,  and 
their  sum  a;  In  tne  gold-mining  problem  these  were  the 
quantities  of  gold  available  in  the  two  mines  at  the  begin¬ 
ning  of  any  stage.  Furthermore,  at  each  stage  of  either 
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process,  we  had  a  choice  of  decisions.  In  the  maximization 
problem,  we  had  a  choice  of  xi  between  0  and »a;  in  the  gold¬ 
mining  problem,  we  had  a  choice  of  a  mine  essentially  a  choice 
of  A  ov  B.  The  effect  of  these  decisions  was  to  transform 
the  descriptive  parameters  into  another  similar  set.  In  the 
maximization  problem,  we  reduce  It  to  a-xj  by  a  choice  of  Xt 
and  reduce  the  number  of  stages  remaining;  in  the  gold-mlnlng 
problem,  we  alter  the  amount  of  gold  in  one  ine  or  the  other. 
If  the  machine  Is  undamaged. 

The  essential  feature  of  these  problems  Is  a  certain 
Invariance,  in  the  sense  that  at  each  stage  we  are  confronted 
by  a  situation  of  the  same  general  type.  Only  processes 
possessing  this  symmetry  over  time  can  be  treated  by  the  tech¬ 
niques  of  the  theory  of  dynamic  programming.  For  processes 
escaping  these  methods,  there  remain  the  computational 
algorithms  of  linear  and  nonlinear  programming,  Monte  Carlo 
techniques,  and,  occasionally ,  for  want  of  better,  the  brute 
force  of  computing  machines. 

^5.  Abstraction . 

We  can  describe  this  invariance  in  the  following  way. 

We  have  an  abstract  system,  S,  whose  state  is  characterized 
at  any  time  by  the  vector  P  ■  (P 1  ,Pa# • • • >Pn) •  At  each  stage 
of  the  process,  we  have  a  choice  of  a  number  of  decisions,  or 
transformations,  which  convert  P  into  a  new  vector  T(P,q), 
where  q  symbolizes  the  decision  we  make.  The  purpose  of  the 
multi-stage  process  is  to  maximize  some  prescribed  function 
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of  the  final  state,  with  the  problem  that  of  determining  the 
sequence  of  decisions  which  produces  the  maximum. 

Let  us  call  any  sequence  of  allowable  decisions  a  policy, 
and  a  policy  that  yields  the  maximum  an  optimal  policy. 

The  fundamental  aim  of  the  theory  is  the  determination  of 
the  structure  of  all  optimal  policies.  In  order  to  obtain  an 
analytic  hold  on  the  problem  we  introduce  the  return  function 
f(P),  the  total  return  obtained  using  an  optimal  policy 
starting  from  the  state  P. 

The  recurrence  relations  obtained  in  the  prevloue  sections 
(5.4)  and  (2.11),  are  both  consequences  of  the  following 
Intuitive  and  plausible  principle: 

Principle  of  Optimality.  An  optimal  policy  has  the  property 
that  whatever  the  first  decision,  the  remaining  decisions 
must  constitute  an  optimal  policy  for  the  state  resulting  from 
the  first  decision . 

Using  this  principle,  the  functional  equation  governing 
the  general  process  described  above  Is 

(1)  f(P)  •  Max  f (T(P,q) ) . 

q 

Applications  of  this  technique  have  been  made  to  various 
parts  of  mathematical  economics,  to  the  theory  of  control 
processes,  and  to  such  fields  of  mathematics  as  the  calculus 
of  variations  and  the  theory  of  differential  equations. 
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Those  Interested  In  further  details  may  consult  the  following 
two  articles: 

1.  "The  Theory  of  Dynamic  Programming",  Bull.  Amer.  Math. 
Soc.,  Vol .  60  (1954 ) ,  pp.  503-516. 

2.  "Dynamic  Programming — A  Survey",  Jour.  Operations  Research 
Society,  Vol.  2  (1954 )  #  pp.  27^-209. 

where  many  further  references  will  be  found. 


